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Progress Summary 

During the period December 23, 1997 and December August 31, 2004, we accomplished 
the development of 2 CFD codes for DNS/LES/RANS simulation of turbine cascade 
flows, namely LESTool and UNCLE. LESTool is a structured code making use of 5 th 


order upwind differencing scheme and UNCLE is a second-order-accuracy unstructured 


code. LESTool has both Dynamic SGS and Sparlart's DES models and UNCLE makes 
use of URANS and DES models. The current report provides a description of 
methodologies used in the codes. 

1. Introduction 

Flow transition plays an important role in turbomachinery applications. The majority of 
boundary layer flows in turbomachines involve flow transition under the effects of 
freestream turbulence, diverse pressure gradients, wide range of Reynolds numbers, flow 
separation, and unsteady wake-boundary layer interactions. 

Prediction of this type of complex flows is an important element in analysis and 
performance evaluation of gas turbine engine components and ultimately in the design of 
more efficient jet engines. Especially, in low pressure turbine applications prediction of 
transition becomes pivotal in terms of efficiency. For low pressure turbines the flow is 
mostly turbulent at the high Reynolds number conditions encountered at take off and the 
efficiency is at its design maximum. However, at high altitudes and cruise speeds which 
correspond to lower Reynolds number conditions, unpredicted losses and substantial 
drops in efficiency have been observed. These losses are attributed to flow separation on 
the suction surface of the turbine blades. At low Reynolds numbers, the boundary layers 
on the airfoil surface have a tendency to remain laminar and hence the flow may separate 
before it becomes turbulent, causing increase in fuel consumption and drop in efficiency. 
The impact of such losses is directly felt on the operation costs. It has been estimated that 
a 1% improvement in the efficiency of a low pressure turbine would result in a saving of 
$52,000 per year on a ty pical airliner. 

In order to calculate the losses and heat transfer on various components of gas turbine 
engines, and to be able to improve component efficiencies and reduce losses through 
better designs, accurate prediction of transitional boundary layers is essential. When one 
deals with a complex fluid phenomena like a transition, separation and turbulence, 
several hundred millions grid points are needed to resolve boundary layers and other flow 
structures correctly. We have started to develop technology to make such large scale 
simulations not only possible at supercomputing centers like NCSA or NAS but on 
inexpensive, high-performance clusters of PCs, or “Beowulfs". These clusters are 
specialized for CFD applications, using the novel approach that the hardware, operating 
system, and application code are optimized together rather than separately. A Honorable 


Mention in the Price/Performance Category of the Gordon Bell Prize was awarded for 
this approach at 1EEE/ACM SC2000 Conference on High-Performance Networking and 
Computing. 

Several turbulence test cases have been computed and an overview of the results is given. 

2. Code Descriptions 

(1) A description of LESTool is give in Appendix 1. 

(2) A description of UNCLE is given in Appendix 2. 

3. Publications by the PI 
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(2) Th. Hauser, R. LeBeau, T. Mattox, P. Huang and H. Dietz. 

Improving the performance of Computational Fluid Dynamics codes on Linux Cluster 
Architectures. 16th AIAA Computational Fluid Dynamics Conference. Orlando, Florida, 
June 23-26, 2003. AIAA. 

(3) Th. Hauser, T. Mattox, R. LeBeau, H. Dietz and P. Huang. Scrutinizing CFD 
performance on multiple Linux cluster architectures. Presentation at the Clusterworld 
Conference & Expo, June 23-26, San Jose, California, 2003. 

(4) Th. Hauser, T. Mattox, R. LeBeau, H. Dietz and P. Huang. A comparative study of 
the performance of a CFD program across different Linux cluster architectures. 
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FL, 2002. 

(5) Hauser, T., R. LeBeau, T. Mattox, H. Dietz and P. Huang. High-Cost CFD on a Low- 
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(6) Th. Hauser, T. Mattox, R. LeBeau, H. Dietz and P. Huang. High-Cost CFD on a 
Low-cost Cluster. Regular paper at SC2000 and honorable mention for the Gordon Bell 
award in the category "price-performance". 2000. 

(7) Th. Hauser and P. Huang. Numerical simulation of turbulent spots. 25th Annual 
Dayton-Cincinnati Aerospace Science Symposium, 2000. 

(8) Th. Hauser and P. Huang. A hierarchical parallelization concept for a high- 
performance Navier-Stokes solver. Proceedings of the International Conference on 
Parallel and Distributed Processing Techniques and Applications (PDPTA'99). 1999. 

(9) Th. Hauser and P. Huang. Large eddy simulation of low pressure turbine flow. 24th 
Annual Dayton-Cincinnati Aerospace Science Symposium, 1999. 

(10) Th. Hauser and P. Huang. Shared Memory Parallelization of an implicit ADI-type 
CFD code. In C. Lin and et. al, editors. Parallel computational fluid dynamics, 
evelopment and applications of parallel technology, proceedings of the Parallel CFD'98 
Conference, pages 145-152. 1999. Elsevier Science B.B. 
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4. Award received by the PI 

A Honorable Mention in the Price/Performance Category of the Gordon Bell Prize was 
awarded for this approach at IEEE/ACM SC2000 Conference on High-Performance 
Networking and Computing \cite{hauser2000}. 


Appendix I - LESTool 





Chapter 1 

Introduction 


Flow transition plays an important role in tuibomachinery applications. Tlie major- 
ity of boundary layer flows in tuibomachines involve flow transition under the ef- 
fects of freestream turbulence, diverse pressure gradients, wide range of Reynolds 
numbers, flow separation, and unsteady wake-boundary layer interactions. 

Prediction of this type of complex flows is an important element in analysis and 
performance evaluation of gas turbine engine components and ultimately in the de- 
sign of more efficient jet engines. Especially, in low pressure turbine applications 
prediction of transition becomes pivotal in terms of efficiency. For low pressure 
turbines the flow is mostly turbulent at the high Reynolds number conditions en- 
countered at rake off and the efficiency is at its design maximum. However, at 
high altitudes and cruise speeds which correspond to lower Reynolds number con- 
ditions. unpredicted losses and substantial drops in efficiency have been observed 
f[?l [?], [?]), These losses are attributed to flow separation on the suction surface 
of the turbine blades. At low Reynolds numbers, the boundary layers on the airfoil 
surface have a tendency to remain laminar aud hence the flow may separate before 
it becomes tiubulent, causing increase in fuel consumption and drop in efficiency. 
Tire impact of such losses is directly felt on the operation costs. It has been es- 
timated that a 1% improvement in the efficiency of a low pressure turbine would 
result in a saving of $52, 000 per year on a typical airliner ([?]). 

In cider to calculate tire losses and heat transfer on various components of 
gas turbine engines, and to be able to improve component efficiencies and reduce 
losses through better designs, accurate prediction of transitional boundary layers 
is essential ({?]). When one deals with a complex fluid phenomena like a tran- 
sition. separation and turbulence, several hundred millions grid points are needed 
to resolve boundary layers and other flow structures correctly. We have started to 
develop technology to make such large scale simulations not only possible at su- 
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percomputing centers like NCSA or NAS but on inexpensive, high-performance 
clusters of PCs, or “Beowulfs" [?]. These clusters are specialized for CFD appli- 
cations. using the novel approach that the hardware, operating system, and appli- 
cation code are optimized together rather than separately. A Honorable Mention in 
the Pri ce/Per form an oe Category of the Gordon Bell Prize was awarded for this ap- 
proach at IEEE' ACM SC2000 Conference on High-Performance Networking and 
Computing [?]. 

Several turbulence tes teases have been computed and an overview of the results 
is given. 



Chapter 2 

Turbulence Models 


2.1 LES 


Filtered equations 


Large-eddy simulation (LES) is based on tbe definition of a filtering operation: a 
filtered for large-scale, or resolved) quantity, denoted by an overbad is defined as 

7(x) = j f(yf )G(x. X )dx , (2.1) 

where D is the entire domaiu and G is the filter function, which determines the 
size and structure of the small scales. For compressible flow the large-scale equa- 
tions are operationally simpler if Favre filtered quantities are used. A Fane filtered 
variable is defined as j — pfij*. Applying the spatial filter G to the governing 
equations leads to 
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4. This term is similar to the heat flux and can be modelled accordingly 

5. This term is neglected because of the same reasons as 2 

6. This term is the viscous SGS work and is neglected 

To summarize the following quantities will be be modeled in our implementation: 

1. SGS stresses r^ GS 

2. SGS heat fl ux q? ,jr * 

2.1.1 Smagrosinsky Turbulence Model 

2.1.2 Dynamic SGS model 

For the left-hand side a trace-free Smagorinsky eddy \ascositv model is used for 
However the eddy viscosity coefficient will be a function of the instantan- 
tous ft ow vari ables. Let 


The isotropic part of the SGS Reynolds stress tensoT q 1 — 7h- has to be mod- 
eled seperately. It is parametnzed using Yoshizawa s [?], [?] expression: 


Germane et. al. [?] showed that expressions similar to the ones multiplying 
Q can become zero. Therefore an averaging procedure is needed to make the 
determinates of the SGS coefficients well conditioned. It is assuired that Cs is 



( 2 . 12 ) 


where ?*- is the trace-free rate of the strain tensor 



(2.13) 


and S | is the norm of 5„, 





(2.14) 


<r — 2C'., pA : |S : 

To confute C-. ilie trace of eq. ?? together with 2.12 is used: 



(2. 16) 
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independent of the directions in which the flow is homogeneous. Volume averaging 
leads to 



To obtain C we also use the model of eq. 2.12 for the test field stresses 



(2. IS) 

This leads to 


L.., = iLvA* - 2C?A ! ll|I* - 2CA 2 p\S\Sr 
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with M %? 


M ij =p\§\st : -fiU) \§\K 

\ ^ y 

(’2.21) 

Contracting eq. ?? with A% which was recommended by Lilly is outlined 
below. Since eq. ?? represents six independent equations in one unknown, its error 
can be minimized by applying a lesat squares approach. Define G as 

G = (La - ~ 2 CM af 

(2.22) 


By setting dQ/dC = 0, C is evaluated as 

^ r> 1 < Uy iyf'ly — \LiliMkk ^ 

C ^ = 2 <M,My> 
where <> denotes a spatial averaging operation. 


2.2 DES model 

The DES turbulence model isba.sed on the blending of R AN'S and LES to provide a 
model that can accurately predict unsteady flows without requiring high-precision 
grids near the walls. A RANS turbulence model is used to simulate the turbulence 
within the boundary layer, gradually subsuming into the LES model as one moves 
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away from the wall. Further details, on this technique can be found in [?}. [?] and 
others. The RANS tuibulence model used in LESTool is the one-equation Spalart- 
Allmaras model. 


with the coefficents given below 

ea = 0.1355, = 0.622, Cvi = 7.1, C& = 5.0, t * i = %\j& + 

= 0.3. <5*s = 2, <r = 2/3, k = 0.41, Q = |V x it | , 
and d is the distance to the nearest wall. Note that this formulation does not include 
the trip functions. Large values of r can be truncated at 10 or so. Hie turbulent 
viscosity for the momentum equations is found from v t = . For the DES 

approach the length scale d is changed into 


with Ci = 0.65. This gives the turbulence model behaviour near the wall and 
LES type behavior away from the wall 



(2.24} 


where 


A = c bi Asfi - F 




(2 .26) 



d* = nun(d, max(Ajr. Ay. Az)C x ) 


(2.2S) 
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(D 1 + ML^dtSU*) = EC [(«7 l ) m -(TO 9 ) m “ 1 ' (3.5) 

(P 1 +Afi|)j(5P 3 .) = d 1 [i6V‘r--au z y-- i ' <3.e» 

This algorithm is easily parallelizable because we loop trough a three dimen- 
sional array plane by plane. The outline of the algorithm is the following: 

1 for each k: solve equations (2) and (3) in independent i-j planes 

2. for each j : solve equations (4 ) and (2) in independent k-i planes 

3. for each i: solve equations (3) and (4) in independent j-k planes 

This means after completion of the above algorithm we have already completed 
two iterations. From our experience three to four iterations are needed foT each 
time-step to reach the required accuracy. 

Fortran 90 was chosen as the programming language for this project because 
of the new powerful array syntax, the addition of derived t\pes and the module 
concept. The code has been developed based on an object-oriented methodology 
to simplify the transition from tire Cartesian implementation to an implementa- 
tion for general coordinates and chimera- type overlapping grids. The new features 
of Fortran 90 like abstract data npes or generic programming support an object- 
oriented programming style even though it is not designed to be an object-oriented 
language. 



Chapter 3 


Numerical method 


The compressible Xavier -Stokes equations are discretized using an iterative diago- 
nal dominant ADI (DD-ADI) algorithm that can be written in the following form: 


The right-hand side is approximated by the newly develop EN0-p3de method 
[?] for the convective flux and sixth-order central for the diffusive terms. The basic 
algorithm is the following. First the flux at each cell face is computed as shown 
below 


Quantities with the superscript H are computed using a Sth -order central in- 
terpolation method. For the quantities with the superscript R or L the standard 
EX O-interpolation [?]. Than the flux is differentiated using a cell centered Pade 
formula [?]: 


The left-hand side is approximated by a lower-order method. The high order 
solution can be achieved by performing inner iterations since the order of accuracy 
is not affected by low order treatment of the left-hand side. 

The proposed algorithm has the following form: 


(B L +AiLi)diSU'- i = D l [(W 3 )” -1 - (W 1 )’” -1 ] (3.4t 



F..+- r . = FiQ; 
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Chapter 4 

Parallelization 


41 Shared Memory parallel approach 

The LESTool code was parallelized in a straightforward manner: The code was 
compiled with the -mp option to enable parallelism in the compile phase. This was 
combined with placing OpenMP compiler directives, w hich instruct the compiler 
to generate code that will execute in parallel. 

These directives were placed at key spots within the code for the greatest par- 
allel efficiency. This resulted in a decomposing of the 3D problem into groups of 
ID lines, with each group assigned to a dedicated processor. 

The efficiency of the parallel decomposition was enhanced by the use of the 
first touch policy which is specific to the SGI Origin 2000. This means that mem- 
ory allocated is physically placed on the processor which touches the memory lo- 
cation first. This means that all large three dimensional blocks of memory where 
initialized in planes of constant k to get a memory distribution which is favorable 
for our algorithmic design (see figure 4.1). 

Since we target shared memory computers there is no explicit data redistribu- 
tion ueeded. Instead we split up the ^-sweeps among the processors by partitioning 
work along the y - axis (fig. 4 2). This is much simpler and easier to implement 
compared to the distributed memory concept. 


4.2 Distributed memory approach 

In order to achiever better portability and performance on a wider range of com- 
putational platforms a distributed approach was also implemented. Tire parallel 
structure of the distributed version of LESTool is based on splitting the compu- 
tational grid into sub-blocks, which are then distributed to each processor (figure 
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Parallelization 


4.1 Shared Memory parallel approach 

The LESTool code was parallelized in a straightforward manner: The code was 
compiled with the -mp option to enable parallelism in tlie compile phase. This was 
combined with placing OpenMP compiler directives, which instruct the compiler 
to generate code that will execute in parallel. 

These directives were placed at key spots within the code for the greatest par- 
allel efficiency. This resulted in a decomposing of tlie 3D problem into groups of 
ID lines, with each group assigned to a dedicated processor. 

Tlie efficiency of the parallel decomposition was enhanced by the use of tlie 
first touch policy which is specific to the SGI Origin 2000. This means that mem- 
ory allocated is physically placed on the processor which touches the memory lo- 
cution first. This means that all large three dimensional blocks of memory where 
initialized in planes of constant k to get a memory distribution which is favorable 
for our algorithmic design (see figure 4.1). 

Since we target shared memory computers there is no explicit data redistribu- 
tion needed. Instead we split up the ^-sweeps among tire processors by partitioning 
work along the t^axis (fig. 42). This is much simpler and easier to implement 
compared to the distributed memory concept. 


4.2 Distributed memory approach 

In order to achiever better portability and performance on a wider range of com- 
putational platforms a distributed approach w'as also implemented. Hie parallel 
structure of the distributed version of LESTool is based on splitting the compu- 
tational grid into sub-blocks, which are then distributed to each processor (figure 
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43 1 For a complex geometry, this is a nontrivial exercise where an uneven load- 
bala nee across different processors could significantly reduce the computational 
• efficiency of the overall code. The partitioning of the grid into sub-blocks is per- 
formed independently of the grid generation, using virtual partitions to define the 
domains corresponding to each processor. The current algorithm governing the par- 
tition process is based on spectral recursive bisection in conjunction with a small 
database of the computational speeds associated with each node. The splitting is 
performed as a preprocessing task, generating a file that maps the grid sub-blocks 
onto the processors. This file is die only difference between performing a seiial 
computation and a parallel confutation on a distributed machine. 

Communications between the grid sub-blocks occurs when the sub-blocks ex- 
change data about the flow variables at the boundaries. As show in figure 43 . the 
flow variables on the edge of one grid block are communicated to the dummy points 
of the neighboring grid block, and vice versa. LESTool requires such a communi- 
cation step after each update, or subiteration, of the flow variables. The low -level 
implementation of the communication between the sub-blocks uses a MPI-based 
communications system. The communication model is a mailbox algorithm where 
the data is sent in a non-blocking communication mode as early as possible. 
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Chapter 5 


Accuracy and Performance of 
LESTooi 


5.1 Accuracy of the ENO-Pade method 

In a previous proposal a fifth -order upwind scheme has been proposed to simulate 
flow around a low pressure tuibine. This method showed to much numerical damp- 
ening so that the transition and separation on the low-pressure turbine couldn't be 
simulated. In order to develop a code suitable for transition and turbulence a new 
numerical method - the ENOPade method - was developed [?]. This method com- 
bines the stability of the ENO scheme with the high-order accuracy of tire Pade 
method. Several testcases have been computed and die damping of the new scheme 
is much lower as you can see below’ from one test case. This case consists of the 
unsteady advection of a vortex in a uniform flow* Here the capabilities of the differ- 
ent numerical schemes to accurately advect the vortex structures are tested which 
is critical to LESDNS simulations. 


5.2 Performance Results 

5.2.1 Single processor performance tuning 

The key point on cache based architectures to achieve high floating point perfor- 
mance is cache-awareness of the algorithm. This stage of tuning is very important 
because efficient utilization of the cache architecture assures good overall perfor- 
mance. Single processor performance is the basic step to get good overall perfor- 
mance for the parallel computation. 

During the development of the code special care was taken by placing variables 
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Figure 5.1: Contour lines of vorticity magnitude at three time intervals of ENOPade 
method 



Figure 52: Swiil velocities of the vortex along y=0 at T = 12 
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which are used concurrently in the calculation close together in the main memory 
One major detail is that the first index in the data arrays is used for addressing the 
different components of global variables, such as conservative variables. This is 
contrary to a vector architecture,, where the index for a variable is normally ad- 
dressed by the last index in an array. The data is copied into one-dimensional 
scratch arrays which fit into the second level cache. The main algorithm is per- 
formed on these scratch arrays which have a memory layout and access pattern 
(stride one) optimal for the cache architecture. This measure alone resulted in a 
floating-point performance of 60 MFLOPS. 

To achieve further speedup of the code, the main bottle necks identified were 
the tridiagonal and block-tridiagonal matrix solvers. To increase the utilization 
of memory bandwidth, all arithmetic operations involving the 5x5 block matrices, 
used in block tridiagonal and periodic block mdiagonal solvers, were unrolled. 
This resulted in a much longer and less desirable code, but the performance results 
outweigh this disadvantage. The LESTool code achieves a floating-point perfor- 
mance of about 120 MFLOP son a ISO MH 2 Origin 2000. We consider this a very 
satisfactory performance . 

Single processor performance 

Single processor run times for a single test case sufficiently large so that it does 
not fit in the L2 cache of all our CPUs has been chosen. The runtime of the code / 
time step is reported in table 5.1. The LosLobos results 1 733 MHz Pentium III) are 
much worse than the Athlon results - this is probably related to (2 computations 2 
data transfers per cycle). However, we are not absolutely confident that this is the 
optimal possible performance of LESTool on LosLobos, and as such subsequent 
results will emphasize the KLAT2 and K.FC1 outcomes. 

Table 5.1: Execution time in seconds on different Linux platforms for a single time 
step for the 64 : test case 


i 

Compiler 

double precision 

cost processor 

$/TSpS 

TOO MHz Athlon 

gcc-3 2 

37s 

$625 

3030 

733 MHz Pentium III 

egcs-2.91.66 

123s 

$2,930 

IS 170 

1 1.4 GHZ Athlon MP 

gee -2. 96 

22s 

$750 

2240 
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5.2.2 OpetiMP performance tuning 

In figure 5.3, the same test case with the improved tine-grain parallelization is 
shown. The removal of the barriers gives a much better speedup and provides a 
better performance of the fine-grain parallelization. The program scales betteT for 
larger number of shared processors. However, there still seems to have a limit 
on the number of the processor used feu the fine-grain parallelization. Since the 
memory is distributed in planes of constant k over the available processors using 
more than 32 processors is no longer advantageous. By choosing 3 much larger 
test case of 256* grid points, a? also shown in figure 5.3 by the dashed line, it can 
be seen that because of a larger computational load per processor the performance 
scales much better. For this case, the program performs well up to 64 processors. 



Figure 5.3: Speedup with Teduced synchronization and orphaned OpenMP direc- 
tives 
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5.2.3 Parallel performance 

In order to achive a reasonable throughput the wall time for our computations with 
LESTool must be at most on the order of 10s / time step. A compaiision of the wall 
times for the three different test cases follows 

Comparison of wall times 

In figure 5 .4 the wall times are compared for KLAT2 and KFC1. The performance 
of KFC1 is obviously better than that of KLAT2 because of the higher clock fre- 
quency. The clock frequency ofKFCl is double than that of KLAT2 but the speed 
of LESTool on KFC1 is obviously not twice compared to that on KLAT2. Here, 
the memory system plays an important role as well-KFC 1 has a PC2 1000 memory 
system, whereas KLAT2 has a PC 100 memory sytem. Our goal to reach a time 
of less than 10s / time step is already reached on two processors on KFC1, while 
for KLAT2 it takes about six processors to come into the same total performance 
range 



Figure 5.4: Wall clock time on KLAT2 and KCF1 for the 64 3 case 

The 128 3 case is much more demanding. In this case it takes about ten nodes 
or twenty processors on our dual processor machine KFC1 to obtain a wall clock 
time under 10s. For KLAT2 forty processors are needed to obtain the desired 
computational speed. Note that for this example KFC1 does achieve twice the 
speed of KLAT2. 

The 196 3 case severely stresses our cluster computers. Here we cannot reach 
the 10s mark on either of the clusters. For KFC1, we can achieve less than 20s / 
time step on 36 processors, or twice the desire speed. With KLAT2 the best wall 
time achievable is 22.7s. 
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Speedup on different network architectures 

KFC1 has a three-way channel-bonded network as described in section ??. The 
code scales well for all three test cases as shown in figure 5.7. Note that we could 
run the single processor version for the 128 3 but the performance was so low be- 
cause of swapping that we derided to scale the larger cases to the minimum number 
of processors on which the case could resonable run. Note that the sizeable drop 
in performance for 34 processors in all three cases. This is related to poor load 
balance. The only way we can divide our grid is in two slices in j- direction and 
seventeen slices in the k-direction. Given the cubical symmetry of this problem, 
the best subdivision is a equal number of cuts in the i, j, k direction. An example 
for an optimal partitioning is the case for 16 processors. Each subblock consists 
of 16 3 grid points. This example shows the importance of the partitioning on the 
parallel performance of LESTool. 



Figure 5.7: Speedup on KFC1 for the 64 3 , 128 3 and the 196 3 cases 

KLAT2 has a FNN network architecture which is optimized for next neighbor 
communication as described in section ??. LESTool scales even better on KLAT2 
than it does on KFC1 for all three test cases as shown in figure 5.8. Note that a 
similar effect caused by uneven load splitting can be seen on this machine as well. 
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Chapter 6 

DNS of homogeneous turbulence 


6.1 Creation of initial conditions 
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Chapter 7 

DNS of turbulent channel flow 


The results presented in this paper were based on a direct numerical simulation 
of a fully developed channel flow reported in [?] The flow has two periodic 
(x and z) directions and the solid wall condition was imposed in the y direc- 
tion. The Reynolds number based on the wall shear velocity, = V T WA 
Re- — V* = ISO, where H is half the channel width. Hie streamwise 

and span wise dimensions of the channel are 4r H and 2r H. respectively. The 
computation is carried out on a grid using 200x121x200 grid points in x , y, and 
z 3 respectively. The flow field is initialized with a laminar solution and random 
fluctuations aTe superimposed on the pressure field. The governing equations are 
then integrated in time until a statistical equilibrium is reached (tu r /H > W). 

Figure 7.1 shows the dimensionless velocity. = u'u- 3 plotred against di- 
mensionless wall distance, = y’u—ji/. Also shown in this figure is the com- 
parison of the predicted mean velocity profile against the data cited in [?] for 
the same Reynolds number. The dotted line represents the desirable profile in 
the viscous sublayer and the dashed line denotes the log-law line, 

= lnu/~ )/0.41 + 5.2. The figure show's that the current mean velocity pro- 
file matches the data of Kim et al. [?] very well. The comparison of the rms values 
as shown in figure 7.2 of the velocity show's good agreement with the data of Kim 
et al. 
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Figure 7.2: Comparison of RMS values 
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Figure 9.2: 3D view of the grid 
9.3. 

9.1 Re = 25000 

9.2 Re = 50000 


In figure 9.4 the averaged solution in a plane is shown. 
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Figure 9.3: Contour surface of initial velocity disturbance 



Figure 9.4: Contour surface of initial velocity disturbance 




18 Inoue, O. and Yamazaki, T. s “Secondary vortex 
streets in two-dimensional cylinder wakes,” Fluid 
Dynamics Research, vol. 25, 1999, pp. 1-18. 




Fig. 3(a) Partitioned triangular mesh for 2D flow over a 
circular cylinder 


(a) 



m 

(b) 

Fig. 1 Schematic diagrams for integration areas, (a) 
convective fluxes, and (b) diffusive fluxes. 


I 




Fig. 2 A schematic diagram of cell-centered partitioning 
approach 


Fig. 3(b) Load-balance distribution on each node in 
parallel computation 



Fig. 3(c) Partitioned tetrahedral mesh for 3D flow over 
a circular cylinder 



Fig. 4 Schematic diagram of laminar incompressible 
flow past flat plate 






Fig. 8(b) The v-velocity contour plot for two- 
dimensional driven cavity flow at Re=400. 



Figure 9(b) The v-velocity profile at the vertical 
centerline of z=0.5 plane with present results, Ku’s and 
Shu’s results. 
























Fig. 12 C L and C D history plots of two-dimensional 
simulation for (a) Re=100, (b) Re=200, (c) Re=300, and 
(d) Re- 1000. 
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Abstract 

A center pressure based method is presented in 
this paper, and which has been implemented into a new 
two/three-dimensional parallel unstructured CFD code, 
UNCLE, which is developed at the University of 
Kentucky to meet the challenges of physical problems 
with complex geometries and complicated boundary 
conditions while maintaining high computational 
efficiency. Good load balancing across computational 
nodes is achieved by using METIS. In order to 
demonstrate the accuracy and performance of center 
pressure based method, several test cases are presented 
for validation such as two-dimensional incompressible 
flow past a flat plate, two/three- dimensional driven 
cavity flow, and two/three- dimensional flow over a 
circular cylinder. Notably, an extensive qualitative and 
quantitative study of two-dimensional flow over a 
circular cylinder for low Reynolds number is also 
presented in this paper. 

1 Introduction 

Continual improvements in computer 
technologies and computational fluid dynamics (CFD) 
algorithms have established CFD codes as a reliable 
tool for fundamental research or industrial applications. 
To deal with increasing grid sizes and demands for 
faster output, parallel computation of CFD has become 
a standard approach. To deal with the different 
challenge presented by some physical problems with 
complex geometries and complicated boundary 
conditions is now approached through unstructured 
CFD grids due to their ability to smoothly conform to 
complicated boundaries. However, combining 
unstructured grids with a parallel code presents still 
other challenges, such as achieving well-balanced grid 
decomposition on a distributed system and efficient 
parallel performance. In order to meet these challenges, 
a center pressure based method has been implemented 
into a new parallel unstructured CFD code called 
UNCLE, which has been developed at the University of 


Kentucky. UNCLE is designed to meet the challenges 
of using unstructured grid codes on high-performance 
parallel computers. It is a two/three-dimensional finite 
volume unsteady incompressible Navier-Stokes solver 
with center pressure based SIMPLE algorithm with 
second order accuracy in both time and space. To 
increase flexibility in complex geometries, center 
pressure based method is extended to use a variety of 
grid types, such as triangular, quadrilateral, tetrahedral, 
and hexahedral meshes. To obtain good load balancing 
across computational nodes, METIS [1] is applied for 
domain decomposition. METIS is a set of programs for 
partitioning graphs and finite element meshes, and for 
producing fill-reducing orderings for sparse matrices. 
The algorithms implemented in METIS are based on 
multilevel graph partitioning schemes. The key features 
of METIS include extremely fast partition, high quality 
partitions, and low fill orderings. The parallel 
construction of UNCLE is based on message passing 
interface (MPI) protocols and has worked successfully 
on systems ranging from commodity PC clusters up to 
traditional supercomputers. In order to demonstrate the 
accuracy and performance of center pressure based 
method, several test cases are presented for validation 
such as two-dimensional incompressible flow past a flat 
plate, two/three-dimensional driven cavity flow, and 
two/three-dimensional flow over a circular cylinder. 

2 Numerical Methods 

A center pressure based method for two/three- 
dimensional finite volume unstructured incompressible 
Navier-Stokes solver for steady /unsteady flow fields is 
presented in this paper. It is center pressure based 
SIMPLE algorithm with second order accuracy in both 
time and space. In order to compute numerical flux on 
interfaces, a second order upwind scheme is adopted to 
compute advection terms and second order central 
difference scheme is used for diffusion terms. Non- 
staggered grids with the Rhie and Chow momentum 
interpolation method [2] is employed to correct the 
checkerboard solution in the SIMPLE scheme. 


2.1 Governing equations 


The governing equations for unsteady incompressible 
viscous flow under the assumption of no body force and 
heat transfer are as below. 


Conservation of Mass 
jpdV = -(^pU'U'dS 


( 1 ) 


Conservation of Momentum 

— JpUjdV = -<j* pu t n t UjdS - ^ pn } dS + r tJ n \ dS @) 
dt v 

Conservation of Energy 

— jp EdV - -<j* pu t n t EdS - ^ pu } UjdS + u } r y n t dS^ 

dt y 

where p is density, p is pressure, u, is component of 
velocity vector, n, is unit normal vector of the interface, 
Ty is tensor of shear force, and specific internal energy 
E = e + i( w 2 -I- v 2 + w 2 ) • Notably, density is constant for 

incompressible flow. 


2.2 Convective and diffusive fluxes 


Figure 1(a) shows the schematic diagram for integration 
area for convective fluxes. By using Taylor series 
expansion, flow properties on the interface can be 
obtained by Eq. (3). 


(*/-**) 




(z f - Zp ^ ) + HOT 




(3) 





( Z / 


-Zp 2 ) + HOT 


where <j> stands for the velocity components and 
pressure, the superscript RHS and LHS denote the 
approximation from the right-hand side and left-hand 
side of the interface respectively, and HOT represents 
higher order terms. By substituting Eq. (3) into Eq. (4), 
interfacial flow properties ^can be obtained. 


*,=}(**“ ++ UB )-\stgnO,mM ms -+“*) ( 4 ) 

The gradients at the nodal points (cell centers) are 
evaluated by the Gauss’s divergence theorem as below. 

\—dV = fyn,dA 

V A / c \ 


dx, V 

where A^ ce is the total number of interfaces of the cell 
and V denotes the volume of the control volume cell. 

The schematic diagram for diffusive fluxes is shown in 

Fig 1(b). The gradients at the interface can be evaluated 

by using Chain rule as Eq. (6). 

^_d^d^ + d^_dri^d^_dC_ 

dx d£ dx dij dx d£ dx 

ei = eidi d^_drj_ et_d£_ 

dy d£ dy drj dy d£ dy 

d^_ = d^df_ df>_dn_ ctydC_ 

dz~ dz + drj dz+ d£ dz 

where the local coordinate system (g,rj,£) is defined by 
the type of mesh separately. 

For triangular mesh in Fig. 1(b), £ is the vector form 
nodal point Pi to P 2 , Tj is the vector from vertex V] to 
V 2 , and Q is the integration area for diffusive fluxes. 
The diffusive fluxes can be approximated by Eq. (7). 

(I* ) ~ ^k [( ^ " ^ >( A - y* ) - (A- - )(>’.; - y ,- )] 

^ ^ ~ — &P, (0r 2 ~ )( x r 2 ~ x l[ )] 

(7) 

where fo, denotes the properties at nodal points and 
denotes the properties at vertices. 

The values of vortices are obtained by averaging 
surrounding nodal value, in which inverse distances 
from all surrounding nodal points are considered as 
weighted function. 

2.3 Center pressure based SIMPLE algorithm 

By using an initial pressure field, P”, we can obtain u n , 
v", and w n by solving the momentum equations in an 
uncoupled form. The momentum equations can be 
written in the form as Eq. (8). 
a c Au = Y j a„ b Au + RHS u 

rib 

a c A\- = Y 4 ^„ h Av+ RHS V 

nb 

a c Aw = XXi Au ' + RHS - 

nb 

where the coefficients a nb and 


( 8 ) 


j c are 


, , Zm,* A t 

C Y _ _k=\ 


( 11 ) 


a„b = max(-OTy,0) + n f (£ x n^ + 4 y n 2 +£,n i )A , 

^=Z o -*’ 

nb 

and the RHS term can be written as 
RHS . =-Y\m,u: -(r" - p" ), n,,A, - (r 2 n , ), n u 4 - (r” ), n,,A, ] 

i=l 

= ~ - ( r il)/ n i.l 4 — ( r 2 l). ” 1 . 2 ^ — Q* K 

-(4 ] n,A - (r’ n - p" ), n, 2 A, - (r" \ n, y A, ] 

i-1 

= “ X l'”« V »" “( r i2 )i n '.l4 ~ ( T 22 ), n i.2^i ~ ( f 32 ) n ,.lA ] “ K 

RHS W = — ^ \rh t w t -(r I3 ). n iX A i — (r 23 ). n i2 A f — (r 33 — p \ n j2 A, ] 

r=i 

= -SK w ” -kUA -(r»b'.2 A i -( r n), n L3 A ,]-^ V c 

where subscript c denotes the cell we are solving, 
subscript nb denotes the neighbor cells, and A denotes 
the interfacial area. In this paper, we solve Eq. (8) by 
using Gauss-Seidel method. Then, we can obtain u\ v , 
and w* by Eq. (9). 
u = u n + Au 

v* = v” + Av ^ 

w* = w" + Aw 

Although at this stage u*, v*, and w * satisfy the 
momentum equations, they do not necessarily satisfy 
the continuity equation. In order to satisfy the mass 
conservation, one has to interpolate the velocity to the 
interface. 

However, this interpolation will lead to the 
checkerboard solutions. In order to avoid the 
checkerboard solutions, one has to allow the interfacial 
velocity to be driven solely by the pressure difference. 
To achieve this aim without sacrificing the accuracy, 
one can divide the interpolated interfacial velocity into 
two components: one is the velocity component without 
the pressure contribution and the other is solely the 
pressure contribution. The former, which is at the cell 
center, can be written as: 


u f =u f - 


dx 

dpT_ 

dy 






f a f 


. _ ( 3p n y f 


Where V f and af are obtained by interpolation to the 
interface. 

We further assume that there are corrections to u/ , v/, 
and w/, such that the continuity equation can be 
satisfied by using Eq. (12). 

+ Au' f )rtj +(v^ + Av^ )n 2 + (w* f + Aw' )n 3 ]A = 0 

(12) 

We can rewrite Eq. (12) as 

N f- 

X P[ A “/" 1 +A v' f n 2 + w)n , = piu' f n , +v’ f n 2 + w)n 2 ]A 

,=i »=1 

(13) 

where the right-hand side in Eq. (13) represents the 
mass imbalance in the control volume cell. 

One assumes there is a corresponding pressure 
correction field, p\ which drives the velocity 
corrections according to: 

A"/ ZAPk-Pp,) 

\dx) f a f a, 


Av > — (%-] 

{dy) / a / a, 

( dp'\ y f v r r , , . 

Aw / * H ■£■ ~ Pn ) 

\dz ) f a f a f 


(14) 


// “/ “/ 

By substituting the velocity correction equations into 
the equation for the mass imbalance, we can obtain the 
equations of the pressure correction: 

<>cP'c=Yj a ’*p''* +b (15) 

nb 

were a nb and a c in the continuity equation are 

°nb = Pt— £r"l + — £,« 2 + — ’ 


• fy" K 

= u +— — - 


II 

M 

& 

dx 0 


nb 

11 

+ 

•P 1^ - 

(10) 

and 


b = p{u f n, +v)n 2 + w)n 2 ]A ■ 

nb 

. dv V, 


Once the pressure correction 

II 

4 

l- 


update the pressure field by: 


The latter is obtained directly from the pressure 
difference of the two adjacent nodal points, P } and P 2 
such that the interfacial velocity can be expressed as: 


p n " = p"+ a p p' 


(16) 


where Op is the under-relaxation factor for pressure and 
is generally with a value of 0.5-0. 8. Then the velocity 


correction on the interfaces as well as nodal points will 
be updated. 

2.4 Partitioning approach 

Figure 2 shows the schematic diagram of partitioning 
approach for center pressure based method, which is 
also implemented into UNCLE. In Fig. 2, blue points 
indicate vertices, red points indicate nodal points, and 
white points indicate the boundary points. By using this 
approach, the control volumes on the boundary are not 
split. Only communication of nodal values is needed for 
parallel computation which makes the implement of 
MP1 in an unstructured grid more straightforward. 

Excellent load balancing between the subgrids on 
each node is achieved through using METIS for domain 
decomposition. METIS can partition an unstructured 
grid into any integer number of zones without losing 
load balance. It is compatible with many platforms, 
convenient for running CFD codes on a variety of 
supercomputer to cluster architectures. Present 
partitioning approach has been tested by a number of 
two/three-dimensional geometries. All results show 
good load balances. In order to demonstrate the 
capability of this partitioning approach, this paper will 
present two cases — one is the grid for two-dimensional 
flow over circular cylinder in triangular mesh, and the 
other is the grid for three-dimensional flow over a 
circular cylinder in tetrahedron mesh. The definition of 
load-imbalance rate L IMB and load-balance rate L B in 
this paper is defined as Eq. (17) and (18). 

Lms = K^y xl00 % o ? ) 

N 

tn'g 

( 18 > 

where N no j e is grid size of the node and N ^ is the 
average grid size. By using load balance rate, we can 
compare the load balance quantitatively. 

Figure 3(a) shows the partitioned grid for 2D flow over 
a circular cylinder in triangular mesh. The number of 
total grid points is 5 1 ,363 and the number of total cells 
is approximately 0.1M. The grid is partitioned to 16 
zones for parallel computation. The cell distribution is 
not uniform, denser near the cylinder and coarser away. 
The load-balance distribution on each node is shown in 
Fig. 3(b). The x-axis indicates the node number and the 
y-axis indicates load-balance rate. The resulting load- 
balance rates are very close to 100% on every node 
with an average load balance rate of 98.37%. Figure 
3(c) shows a partitioned grid for three-dimensional flow 
over circular cylinder with the internal grid distribution 
visible in a cut-away. The number of total grid points is 
approximately 0.3M and the number of total cells is 
approximately 1.3M. In this case, the grid is partitioned 
to 32 zones. The average load-balance rate is 97.9%. 
Our test results show that present partitioning approach 


has excellent load balance in two/three-dimensional 
grids with various types of meshes by using METIS for 
domain decomposition. 

2.5 Time discretization 

In this paper, a second-order fully implicit scheme is 
employed for the temporal discretization. In here, we 
take a one-dimensional equation example: 

3<r’ (19) 

2A t dx 

where </> is primitive variable, / is interfacial flux, and 
the superscript n indicates the index in time. A deferred 
iterative algorithm is employed to obtain ft* 1 by 
substituting (20) into (19), 

) W+I = ( f +1 ) m + (A <f>) m (20) 

where the subscript m stands for the sub iteration level. 
The final equation is 
3(A t) m df(W 

2 A/ ar (21) 

w-r ] ) 3((<rT -f) tmr'D 

2At 2A t dx 

The right-hand-side of Eq. (21) is explicit and can be 
implemented in a straightforward manner to discretize 
the spatial derivative term. The deferred iterative 
algorithm is strongly stable, and the solution fl* 1 is 
obtained by using inner iterations to reach the 
convergent solution of the right-hand-side of Eq. (21), 
which means A<j> is approximate to zero. A sub-iteration 
is performed at every time step so that this method is 
fully implicit. 

3 Results 

3.1 Two-dimensional laminar incompressible flow past 
a flat plate 

In this case, two-dimensional laminar incompressible 
flow past a flat plate is simulated. The schematic 
diagram including geometry and boundary conditions is 
shown in Fig. 4. A uniform free stream boundary 
condition is imposed on the inlet. Because of the 
viscous effect, the laminar boundary layer begins to 
grow at the position x=0 on the plate. The initial 
condition for the entire computational domain is 
uniform free stream. A quadrilateral mesh of 600x200 
is used for both Reynolds number (Re) at 5,000 and 
50,000 based on a non-dimensional length scale of 
unity. The minimum distance from the wall is 5x1 0' 5 
(corresponding to y + =0.1) which is fine enough to 
capture the phenomena within the boundary layer. In 
this case, the computational domain is divided into 16 
zones. The parallel computation is performed on 
KFC3a, a 16 nodes PC cluster with Pentium IV 2.4 


GHz CPU and gigabit network developed at the 
University of Kentucky. The results of this computation 
are compared to the standard Blasius solution for a 
laminar flat-plate boundary layer. Figure 5(a) shows the 
rj = yy [R^/ x versus w/U plot at x=4.5 from present 

results for Re=5,000 and 50,000 in comparison with 
Blasius solution. Both present results are good 
agreement with Blasius solution. Figure 5(b) shows the 
plot of momentum thickness (Re e ) versus coefficient of 
skin friction ( c f ) with present results and Blasius 
solution. Good agreement is obtained by comparing our 
present results with Blasius solution. 

3.2 Two-dimensional driven cavity’ flow 

In this section, two-dimensional incompressible flow in 
a square cavity at a Reynolds number of 400 is 
simulated. The fluid in the cavity is driven by a moving 
top with constant speed. Because driven cavity flow 
lacks an exact solution, an existing accurate numerical 
solution for this problem is used as a benchmark for 
comparing our results. Ghia et al. [3] presented 
numerical studies using the vorticity-stream function 
formulation for solutions up to Re= 10,000 with 
257x257 grid points, and these simulation results have 
been widely used as a benchmark for the driven cavity 
problem. The schematic diagram of this case with 
geometry and boundary conditions is shown in Fig. 6. 
The initial condition for the entire computational 
domain is stationary everywhere. In order to compare 
Ghia’s results, the number of grid points used is 
257x257 or 66,049 and 65,536 cells are used in a 
quadrilateral mesh. For a triangular mesh, 66,546 cells, 
which is approximately the same as quadrilateral mesh, 
and 33,618 grid points are used in our computation. 
Figure 7(a) shows the u-velocity profile along the 
horizontal center line for both present results with 
quadrilateral and triangular mesh and Ghia’s result. 
Both present results are in good agreement with Ghia’s 
result. It also shows the present solution is identical and 
is independent of mesh types. Figure 7(b) shows the u- 
velocity profile along the horizontal center line; again, 
the results match. Figure 8 presents the u- and v- 
velocity contours and streamline plot from the present 
results on the quadrilateral mesh. In Fig. 8(a), the u- 
velocity contours ranges from -0.33 to 1.0 with 100 
intervals. Solid lines indicate the positive values and 
dashed lines negative ones. In Fig. 8(b), the v-velocity 
contours ranges from 0.64 to 0.31 with 100 intervals. 
The flow structures including the location of the major 
vortex center, the bubble in the right bottom comer, and 
a small bubble in the left bottom comer are shown 
clearly in Figure 8(c), and are in good agreement with 
the results of Ghia. 


3.3 Three-dimensional driven cavity flow 

The three-dimensional version of the preceding 
problem is also a standard case for a new flow solver. 
In 1987, Ku et al. [4] simulated three-dimensional flow 
in a cubic cavity by using pseudospectral methods to 
solve the Navier-Stokes equations for Re = 100, 400, 
and 1000. In 2003, Shu et al. [5] repeated this problem 
by using the SIMPLE algorithm with the differential 
quadrature (DQ) method. They simulated the three- 
dimensional driven cavity flow at Re = 100, 200, 400 
and 1000 and compared their results with Ku’s results. 
In this section, we simulated this problem at Re = 400 
and compared our results with those of Ku and Shu. In 
order to validate the results with different meshes, two 
grids are used to study this problem. One is a 
hexahedral mesh with 67x67x67 grid points and 
287,496 cells, and the other is a tetrahedral mesh with 
79,951 grid points and 446,953 cells. The geometry of 
this problem is a unit cube. The boundary condition at 
the y = 1 plane is uniform flow with u=l, v=0, and 
w=0, and all other boundary conditions are no-slip 
walls. The initial condition for the entire computational 
domain is stationary. Figure 9(a) shows the u-velocity 
profile at the horizontal centerline of the z = 0.5 plane 
for the present hexahedral and tetrahedral mesh results 
as well as those of Ku and Shu. Figure 9(b) shows the 
v-velocity profile at the vertical centerline of z = 0.5 for 
the same set of simulations. Both present simulations 
show essentially identical solutions, and both are in 
good agreement Ku and Shu. Figure 10 shows the 
velocity vectors and pressure contours taken from the 
present simulation using the hexahedral mesh on the 
z=0.5, x=0.5, and y=0.5 planes respectively. As before, 
for the pressure contours dashed lines mean negative 
values. The established flow structures including the 
locations of the vortex center and the positive and 
negative regions in each plane are clearly visible and in 
good agreement with Ku’s and Shu’s results. 

3.4 TwoAThree-dimensional incompressible flow over a 
circular cylinder 

Flow over a circular cylinder is a standard unsteady test 
problem. Many two-dimensional numerical studies 
have been done on this flow. In 1990, Rogers and Kwak 
[6] studied flow over a circular cylinder for Re = 200 
using an upwind differencing scheme to solve the 
incompressible Navier-Stokes equations. They 
compared their results, which were obtained using 3rd 
order and 5th order upwind differencing schemes, with 
computational and experimental results. In 1995, Ku [7] 
studied this problem for Reynolds numbers 100, 250, 
500, and 1,000 using a pseudospectral element method. 
Alonso et al. [8] studied the vortex shedding over a 
circular cylinder at Re = 500 with a multigrid unsteady 
Navier-Stokes solver with a flux-limited dissipation 


scheme in 1995. In 1998, Liu et al. [9] used 
preconditioned multigrid methods to study unsteady 
incompressible flows. They investigated flow over a 
circular cylinder at Re = 200 and their results was in 
good agreement with other computational and 
experimental results. In 1999, Ronald et al. [10] 
reported their results for this problem at Re = 300 using 
the NASA FUN2D code. A final example of two- 
dimensional cylinder flow simulation is the work of 
Qian and Vezza [11] studied flow over a circular 
cylinder problem at Re = 1000 with a vorticity-based 
method in 2001. 

Examples of three-dimensional numerical studies of 
flow over a circular cylinder are the studies of Braza 
and Persillon [12] and Henderson [13]. There are also 
many experimental studies of flow over cylinder 
problems such as Roshko [14], Wille [15], and 
Williamson [16], [17]. According to Williamson [16], 
the phenomena of flow over a circular cylinder can be 
classified by Reynolds number. Williamson found the 
laminar vortex shedding region to occur for Reynolds 
numbers between 49 and 140-194. The three- 
dimensional wake transition region occurs for Re ~ 190 
to 260. For the range between Re= 260 - 1000, the 
three-dimensional disorder of the wake begin to 
increase in at the fine scales. This evolves into the 
shear-layer transition regime for Reynolds numbers 
1 ,000 up to 20,000. He also noted that three- 
dimensional effects occur for Re > 190. Further detailed 
explanations of the remaining regimes are reported in 
Williamson [16]. To date, most numerical studies about 
this problem only focused on a single Reynolds number 
and either two-dimensional or three-dimensional 
simulations exclusively. The current results encompass 
Reynolds numbers 100, 200, 300, 500, 1,000, and 1500 
for two-dimensional simulations and 100 and 200 for 
three-dimensional simulations. These results are 
compared with the appropriate previous studies as well 
as current results cross-comparisons to examine the 
dimensional and flow regime effects. 

Figure 11 shows a schematic diagram for flow over a 
circular cylinder with dimensions and boundary 
conditions. For three-dimensional simulation, the span 
of the cylinder is 10D, with D representing the 
reference length equal to the diameter of the cylinder. 
The boundary conditions at z = 0 and z = -10D are 
periodic, eliminating end effects. The initial condition 
for the entire domain is uniform flow as inflow for all 
simulations and the time step is 0.005 for all cases. The 
grid for two-dimensional simulations is a quadrilateral 
mesh with 22705 cells and 22925 grid points; for three- 
dimensional simulation, a hexahedral mesh with 1.1 3M 
cells and 1.1 7M grid points is used. Both grids are 
densely distributed near the cylinder and wake region 
and coarser near the outer region. All of the two- 
dimensional simulations are performed on the KFC3a 


cluster with 1 6 nodes, and all three-dimensional 
simulations are performed on KFC2, a 48 node 
commodity cluster with AMD Athlon 2000+ CPU and 
a channel-bonded network. It is noted that each three- 
dimensional simulation took approximately one week 
with 32 nodes on KFC2. Figure 12 presents the 
coefficient of Lift (Q) and the coefficient of drag ( C/> ) 
unsteady histories of the present two-dimensional 
simulations for Re = 100, 200, 300, and 1,000 
respectively. The Strouhal number ( S ,) for these data 
sets is derived from the frequency of Q. In our 
simulations, higher Reynolds number cases achieve 
steady Strouhal numbers faster than lower Re cases, 
consistent with other computational results. Table 1 
presents the summary of our present two- and three- 
dimensional results along with other computational and 
experimental results. As shown in Table 1, our current 
results show good agreement with previous data by 
comparing S t9 Q, and C D . Fig. 13 shows the two- 
dimensional vorticity contours for Re = 100, 200, 300 
and 1000. In Fig. 13, the contours are range from -0.3 
to 0.3. The three-dimensional co 2 contours from the 
current simulations are similar to those generated by the 
two dimensional test cases. As seen in Fig. 13, the 
vortex shedding frequency increases as Re increases. 
For Re = 100, the vortices decay in the downstream. 
Because of the limited computational domain, we do 
not see the vortex structures merge to large scale 
vortices in our simulation. For Re = 200, 500 and 1000, 
not only does the vortex strength decay, but also the 
vortex structures collapse in the far downstream and 
start to merge to large scale vortices. This same 
phenomenon is reported by Inoue and Yamazaki [18]. 
Figure 14(a)-(c) show the present three-dimensional 
results of co x , co y and co z for Re = 100 and (d)-(f) for Re 
= 200 in the y = 0 plane. The contours of co x and co y 
range from -0.0005 to 0.0005 for Re = 100 and from - 
0.02 to 0.02 for Re = 200, and the contours of co z range 
from -0.3 to 0.3 for both cases where bright regions 
correspond to positive values and dark regions 
correspond to the negative values. Obviously, the 
magnitude of co z is much larger than the magnitudes of 
co x and co y . For Re = 100, we can see that the magnitude 
of co x decreases in the upstream and the local minimum 
appears at the position of 5th “roll” in Fig. 14(a). After 
that, the magnitude of co x begins to grow. In Fig. 14(b), 
the magnitude of co y grows after the flow past the 
cylinder. The thickness of the vortex “roll” increases 
downstream. From Fig. 14(c), the magnitude of co z 
decays after the flow past the cylinder due to the energy 
dissipation. Because the three-dimensional effects are 
very weak at Re = 200, the three-dimensional outcome 
is very similar to two-dimensional case. For Re = 200, 
as seen from Fig. 14(d), the magnitude of w x decreases 
in the beginning and the local minimum also appears 


near the 5th “roll”. In the far downstream, a transition 
zone is observed, after which the vortex scales transit 
from small to large scales. In Fig. 14(e), unlike Fig. 
14(b), wavy vortex structures are observed. Figure 14(f) 
shows that co z decays after the flow past the cylinder. 
Figure 15 shows the iso-surface of vorticity magnitude 
for Re = 200 from the present three-dimensional 
simulation result. The flow structures, especially the 
vortex streets in the wake regions, observed in our 
present results are in good agreement with other 
simulation results. 

4 Conclusions 

A center pressure based method is presented in this 
paper, and which has been implemented successfully to 
a new two/three-dimensional parallel unstructured 
incompressible Navier-Stokes solver, UNCLE, which 
has been developed at University of Kentucky. 
Implementation of using different types of meshes with 
center pressure based method is feasible and 
straightforward. In order to increase flexibility in 
complex geometries, center pressure based method has 
been extended to use a variety of grid types, such as 
triangular, quadrilateral, tetrahedral, and hexahedral 
meshes. Mesh independent tests are also made to prove 
that current method can generate identical solutions for 
different mesh types. By using METIS for domain 
decomposition, excellent parallel load balance is 
achieved. In this paper, several test cases are 
presented — laminar incompressible flow past a flat 
plate, steady two/three-dimensional driven cavity flow, 
and unsteady two/three-dimensional flow over a 
circular cylinder for low Reynolds numbers. All these 
test cases yielded good agreements in comparison with 
previous computational or experimental results. A 
complete qualitative and quantitative study of two- 
dimensional flow over a circular cylinder for low 
Reynolds number is presented in this paper, which can 
be further used as a benchmark solution set for the 
development of new unsteady Navier-stokes solvers. 
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Fig 5(a) Plot of 77 versus w/U at x=4.5 with present Fig. 7(a) The u-velocity profile along the horizontal 
results and Blasius solution center line for present results and Ghia’s result. 



Fig. 5(b) The plot of Re e versus cywith present results 
and Blasius solution 



Fig. 6 The schematic diagram of two-dimensional 
driven cavity flow with boundary conditions 



Fig. 7(b) The v-velocity profile along the vertical center 
line for present results and Ghia’s result. 



Fig. 8(a) The u-velocity contour plot for two- 
dimensional driven cavity flow at Re=400. 









Table 1 Summary of present results and other computational and experimental results 


Re 

100 

200 

300 

500 

1000 

1500 

Present 
2D results 

C L =±0.314 

C d =1.325±0.008 

S,=0.165 

C l =±0.642 

C d =1.318±0.04 

S,=0.196 

Ci.=±0.869 
C D =1.354±0.072 
S,=0.2 1 

C L =± 1.115 
C D =1 406±0.119 
S, =0.224 

C l =±1.378 

C d =1.489±0.198 

S,=0.239 

C L =± 1.553 

C D =1.575±0.247 

S,=0.246 

Present 
3D results 

C l =±0.322 

Cd=1.327±0.009 

S,=0.164 

C l =±0.664 

C d =1.324±0.042 

S,=0.195 





Computational 2D results I 

Rogers and 

Kwak [6] 

(5 th order) 


C l =±0.65 

C D =1.23i0.05 

S,=0.185 





Alonso et al. 
[7] 




C L =±1 046 
C D =1 217 
Si=0 224 



Ku [8] 

C L =±0 228 
C D = 1. 33-1.358 
S, =0.1675 



C L =±1 03 
Cd=1 212-1.481 
S, =0.2203 

C L =± 1242 
C D = 1 . 1 87— 1 .65 1 
S, =0.2326 


Liu et al. [9] 


C l =±0.69 

Cd=1.31±0.049 

S,=0.192 





Ronald et al. 
[10] 







Qian and Vezza 
[11] 





C D =1.52 

S,=0.24 


I Computational 3D results 

Braza and 

Persillon [12] 



S, =0.202 




j He ’, 7 . , 



wmmm* 





Roshko [14] 


S,=0. 1 9 





Wille [15] 




u 

■■■■ 


BRH 

S,=0.166 






mwmm 


S,=0.197 






49 










